![]() |
The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson Wiley Computer Publishing, John Wiley & Sons, Inc. ISBN: 0471353817 Pub Date: 03/01/99 |
Previous | Table of Contents | Next |
We implemented four different keying options. All of our keying options pre-compute Ki for i = 0,...,39 and use 160 bytes of RAM to store these constants. The differences occur in the way the function g is implemented. There are several other possible keying options, each with slightly different setup/throughput tradeoffs, but the examples listed below are representative of the range of possibilities.
As shown in Table 5.1, the code size for a fully optimized Twofish implementation on a Pentium Pro ranges from about 8450 bytes in assembler to 14100 bytes in Borland C. In assembler, the encryption and decryption routines are each about 2250 bytes in size; the remaining 4000 bytes of code are in the key scheduling routines, but about 2200 bytes of that total can be discarded if only 128-bit keys are needed. In Borland C, the encryption and decryption routines are each about 4500 bytes in length, and the key schedule routine is slightly less than 5000 bytes. Note that each routine fits easily within the code cache of a Pentium or a Pentium Pro. These sizes are larger than Blowfish but very similar to a fully optimized assembly language version of DES. Note that, with the exception of the zero keying option, the code sizes in the table are for fully unrolled implementations; in either C or assembler, it is possible to achieve significantly smaller code sizes using loops with round counters, at a cost in performance.
In addition to the code, there are about 4600 bytes of fixed tables for the MDS matrix, q0 and q1 required for key setup, and each key requires about 4300 bytes of key-dependent data tables for the full keying option. During encryption or decryption, these tables fit easily in the Pentium data cache. The other keying options use less data and table space, as discussed above.
For machines with sufficient RAM and a good memory cache subsystem, large precomputed tables can be used to reduce the key setup time for Twofish even further. For example, in compiled, full, or partial keying modes, the first two levels of q0 and q1 lookups with one key byte can be precomputed for all four S-boxes, requiring 256 Kbytes of table (four tables of 64 Kbytes each). This approach saves roughly 2000 clocks per key setup on the Pentium Pro in assembly language; details are shown in Table 5.2. For instance, the compiled mode key setup for 128-bit keys on a Pentium Pro can be reduced from 8700 clocks to 6500 clocks. Unfortunately, the savings on Pentium and Pentium MMX CPUs seems to depend on the performance of the L2 cache subsystem (which is included in the Pentium Pro and thus is more predictable); the gain seems to range from 500 clocks down to nothing. Implementing this big table version in C also leads to savings of about 1000 clocks per key setup on the Pentium Pro, depending on the quality of the compiler; again, Pentium performance gains are minimal.
Processor | Lang | Keying Option | Code Size | Clocks to Key | Clocks to Encrypt | ||||
---|---|---|---|---|---|---|---|---|---|
128 | 192 | 256 | 128 | 192 | 256 | ||||
PPro/II | ASM | Comp. | 271,200 | 6500 | 9200 | 11900 | 258 | 258 | 258 |
PPro/II | ASM | Full | 270,600 | 5300 | 8000 | 11000 | 315 | 315 | 315 |
PPro/II | ASM | Part. | 272,900 | 2600 | 5300 | 8200 | 460 | 460 | 460 |
PPro/II | MS C | Full | 273,300 | 7300 | 11200 | 15700 | 600 | 600 | 600 |
Table 5.2. Twofish Performance with Large Fixed Tables |
Previous | Table of Contents | Next |